CUDIA: Probabilistic Cross-level Imputation using Individual Side Information

نویسندگان

  • Yubin Park
  • Joydeep Ghosh
چکیده

Due to privacy or legal issues, aggregate data publication is a common practice in healthcare and medical research. However, to find out valuable individual level relationships from the aggregate data, many data mining algorithms suffer from the aggregation bias and the information loss, or require rather strict assumptions, which are usually unverifiable. Furthermore, even if individual level data are available, as many healthcare studies are performed with a pre-specified goal, a limited scope of variables constraints the range of the research focus. How can one run data mining procedures on such data where different variables are available at different levels of aggregation or granularity? In this paper, we seek a better utilization of variably aggregate datasets, which are possibly from different sources. By modeling the generative process of such datasets using a Bayesian directed graphical model, we propose a novel “cross-level” imputation technique. The imputation is based on the underlying data distribution and shown to be unbiased. This imputation can be further utilized in a subsequent predictive modeling, showing improved performances than just imputing the aggregate information as it is. Experimental results using a simulated dataset and the Behavioral Risk Factor Surveillance System (BRFSS) dataset are provided to illustrate the generality and capabilities of the proposed framework.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Inbred Strain Variant Database (ISVdb): A Repository for Probabilistically Informed Sequence Differences Among the Collaborative Cross Strains and Their Founders

The Collaborative Cross (CC) is a panel of recently established multiparental recombinant inbred mouse strains. For the CC, as for any multiparental population (MPP), effective experimental design and analysis benefit from detailed knowledge of the genetic differences between strains. Such differences can be directly determined by sequencing, but until now whole-genome sequencing was not public...

متن کامل

Bias Characterization in Probabilistic Genotype Data and Improved Signal Detection with Multiple Imputation

Missing data are an unavoidable component of modern statistical genetics. Different array or sequencing technologies cover different single nucleotide polymorphisms (SNPs), leading to a complicated mosaic pattern of missingness where both individual genotypes and entire SNPs are sporadically absent. Such missing data patterns cannot be ignored without introducing bias, yet cannot be inferred ex...

متن کامل

Estimation of genotype imputation accuracy using reference populations with varying degrees of relationship and marker density panel

Genotype imputation from low-density to high-density (SNP) chips is an important step before applying genomic selection, because denser chips can provide more reliable genomic predictions. In the current research, the accuracy of genotype imputation from low and moderate-density panels (5K and 50K) to high-density panels in the purebred and crossbred populations was assessed. The simulated popu...

متن کامل

Measuring Disclosure Risk for a Synthetic Data Set Created Using Multiple Methods

Government agencies must simultaneously maintain confidentiality of individual records and disseminate useful microdata. We propose a method to create synthetic data that combines quantile regression, hot deck imputation, and rank swapping. The result from implementation of the proposed procedure is a releasable data set containing original values for a few key variables, synthetic quantile reg...

متن کامل

A Probabilistic Imputation Framework for Regression Analysis using Variably Aggregated, Multi-source Healthcare Data

Many measures of healthcare delivery or quality are not publicly available at the individual patient or hospital level largely due to privacy restrictions, legal issues or reporting norms. Instead, such measures are provided at a higher or more aggregated level, such as state-level, county-level summaries or averages over health zones (HRRs and HSAs). Such levels constitute partitionings of the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011